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Why we are here 


e Tounderstand the fundamentals on how a Gpu work 
e Will help you to understand performance issues 
e To share ideas 


Overview 


Quick review of the graphics pipeline 

Mapping the graphics pipeline into the gpu blocks 
How a shader core works 

Some real gpu use cases 

Mobile Gpus 

Conclusions 


Rasterization in six slides (I) 


Before we start we need to understand the problem we want to solve 
Turning triangle data into pixels 
Many steps involved 


o Geometry processing 
m Project triangles in screen space 


e Rasterization 
o  Findthe the pixel covered by triangle 
m Or triangle walking 
e Pixel processing 
o Actually assign a color to the pixel 


Rasterization in six slides (Il) 


e Inthe beginning we only have vertex data 
e 3d point coordinates 


Rasterization in six slides (III) 


e Vertex are transformed and projected in 2d space 


e We call this “vertex shading" è É 


Rasterization in six slides (IV) 


e They are then assembled into a primitive 


Rasterization in six slides (V) 


e Then we determine what pixels of the screen the primitive “touches” 
e We call this "fragments" 


Rasterization in six slides (VI) 


e Finally we need to assign a color to each one of them 
e We call this "Fragment shading" 


Gpu to help 


e Amodern gpu can accelerate all of this 
o Wasn't always the case in the past, but that's another story 
All of the previous operation map to several specific HW block 
Some functionality are programmable and performed by the shader cores 
o le. fragment shader, vertex shader 


e Other are fixed but parameterizable 
o ie. Primitive assembly, blending. 


Graphics pipeline 


Output 


Merger 


7 


Programmable part 


(executed by shader system) 


e "Logical" pipeline described in OGL/DX specification 
o It's an abstraction 
e At physical level things are very different 
o As long as specs are met there's no problem 
e Today we will look at things from a slightly closer POV 12 


Anatomy of a GPU 


e Extremely Parallel machine 


o Thousands of “threads” in flight 


o But 
m Limited flow control 
Some threads shares program counter 


m No Inter process communication 
Extremely good at doing lots of independent operations at the same time 


O 


Memory bandwidth is very high 
o Hundreds of GB/s 
o But 
m Very high latency 
e Thousand of cycles 
Latency hidind mechanism necessary 


e Graphics pipeline is organized to overcome those constraints 
Dom 


Before all that 


e CPU issues commands to the GPU 


o eg 
m Draw using those vertices and indices 
m Set this viewport 
m Changing states 
m May contain constants for shaders 
n 


Blend everything over using this blending function 
e Commandare not executed immediately 
o Typically the cpu prepare command for the next frame while the GPU is rendering the current 
one 
m Double buffering 
o The commands written into a command buffer 
o The GPU parse them 


Introducing Command Buffer Parser 


Parse the command buffer 
Send commands down the graphics pipeline 
Synchronization point between cpu and gpu 
o Surface synch 
m Eg: wait that all the draw command on that rt finished before 
binding it as texture 


e Cpu bound applications: command buffer is not filled 
fast enough and Gpu is idle. 


Geometry stage 


Lots of sub stages 


e Input assembly 
e Vertex shading 
e Primitive assembly 


Plus optional stuff: 


Domain shader 

Tessellation shader 

Geometry shader 

Stream out 

For simplicity we are skipping those 


Input assembly Unit 


Fetches the indices / vertex from main memory 


Has a vertex reuse cache 
o  Triangles share vertices , so it is likely to have cache hit. 
o Cache miss means the vertex need to be sent to the shader 
system to transform 


e When enough cache misses are accumulated a job is 
sent to the shader core 

e Usually there are more than one input assembly unit in 
a GPU. 


o Work distribution is usually done at drawcall level 
m Eg, assign 128 indices to a different Al unit. 


Vertex shading 


e Done by shader Core 
o Details later 


First stage that is entirely programmable 
Export position and vertex attribute to forward to pixel 
shading 

e Positions are stored in a positional cache, used in 
primitive assembly/setup 

e Attributes are stored in a separate cache, they are 
needed only in pixel shaders 


Primitive assembly Unit 


So far we have only point (vertex) transformed 
Primitive assembly takes the position form the 
position cache 

e Use the connectivity information we gave in the API 
(eg Trilist) 
And turn them in to triangles 
At this point triangles need to be discarded if 
outside the view 

e Clipped if partially in view 

o Clipping produces more triangle, expensive so guard band 
used to minimize 


Clipped 
triangle 


viewport 


Primitive assembly Unit 


e Surviving primitive are then perspective 
projected (divided by w) and viewport 
transformed 

e Backfacing and zero area culling happens 
here 
Vertices are "snapped" into pixel 
A CPU bounding box culling can avoid PA 
Being overwhelmed 
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Triangle rasterization Unit 


e Find which pixel covers a triangle 

e Done in an hierarchical fashion, at least 2 level 

e There are many rasterizers in a GPU, each one 
serving a portion of the screen 

e perform hierarchical Z and early Z 

e Assembles quad (2x2) pixels 

e When enough quads are accumulated a job is sent to 
the shader system 


Coarse rasterization 


e Screen is divided in “tiles” 

e Triangle is first tested against those tile 

e Ifthe triangle doesn't hit the tile then we saved 
unnecessary tests 


| -8x8 pixel tiles 
| | =Non processed 


tiles 


Hierarchical Z unit 


New Triangle 
cover tile X 


Perform early rejection of the primitive 


For each tile 
o  Keeptrack of the current min and max z in tiles 
o Ifthe triangle min z is larger than tile max z 
o  Rejectthe triangle 
o Otherwise update the min max 


Also handling fast z clears 

Can skip fragment processing of entire 
triangles 

Remember : triangles are never sorted 
anywhere they are processed in accépt reject 
submission order! 


Triangle Z 
> 


tile Z 
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Early Z 


Similar to Hiz 

But Done at sample level 

Compute the depth of the pixel before of its color 
Pixel shader is not executed if the test don't pass 
Not always possible 

le. Tralucent , discard 
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Depth compression 


If a triangle touches more pixels it's expensive to store its depth as a float 
Instead you can use 3 float to identify the whole plane of the triangles. 
Example: tile is 8x8 pixels x 4 bytes float is 256 bytes of uncompressed depth 


o If we use plane compression best case is 12 bytes ( 1 triangles cover the whole tile) 
o  Atsome point in this case after 21 planes, it start to have the same footprint. 


e Greatly reduce memory bandwidth for big triangles 
o  Smalltriangles -> more bandwidth 
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Fragment shading 


e Once the rasterize packed the enough quads, fragment 

shader can be dispatched 
o Thousands of pixel shading operation can be in flight at the same time 
in a gpu 

e Depends on the architecture usually is 16 or 8 quads 
dispatched together 

e Thistask if performed by the shader system 

e Quads are needed for calculation of the derivative 


o Derivative are needed for selection the mip level 
o Mips are needed for better visual quality and performance 


Pixel shading: a note about quads 


e adquad may contain only one primitive. 
o Incase the primitive does not touch 4 pixel , extra 
"ghost" are created 
o Ghost pixel are created alongside of the edge 
Ghost pixel are necessary for derivative calculations 
If the triangles is big enough this ia not a problem 
m We only have pixels across the edges 
o  Bottleneck: Small triangles will create tons of this 
threads, lots of overshading. 


Wasted pixel in a quad 


Wasted pixel in a quad 
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Texture unit 


Shaders usually need to access texture 
Request are performed by a texture unit 
A texture unit may serve more than 1 shader core. 


If the requested textel is not in the cache 
o It fetched from main memory 


o Usually very long latency... thousands of cycles 
o Imagine cooking something 
o  Butyou need to walk to a shop a mile away each time you need 
an ingredient 
e Perform texture interpolation 


e Does Decompression 
o Some arch have compressed cache, some other don't 


Output merger 


Also called Raster ops 

Export pixel color to render target(s) 

Write to main memory 

Also perform blend operations 

Limited number of pixel operation per clock 


Updates the Z buffer 
o "Late Z" 
o In Opengl/DirectX specs “Late Z” is the only stage for pixel 
rejection 
o Pixel needs to be exported in submission order (dx specs) 
o Also Late Z is the only Z rejection system that works if the 
pixel shader update z or using alpha mask 


Compute shaders 


e Used for Generic computation 
o Not bound to rasterization 
o So the command processor will send those command directly to 
the shader system 


Support to read/write textures and buffer plus atomics 
Shared local and global limited storage 

Can be asynchronous and run in parallel with graphics 
work 


Shader Core 


The programmable part of the gpu 


There are many shadercore in a gpu 
o =A lot of work can be done in parallel 
o Example: Geforce RTX 2080 has 2944 cuda cores 


e Very simple unit compared to a CPU 
o In order execution 

No speculation 

No branch prediction 

But very fast at context switching 

Very good at latency hiding 


e Multiple ALU shares program counter 


o O O O 


VLIW architecture 


e Very Long instruction word 
o Example: vliw4 (4 pipe vliw) means each core could do 4 
independent instructions at the same time 


Maps very well with simple per pixel operation (dot, etc) 
Doesn't map well with general programming 
Compiler need to statically schedule things to the 
vector pipes. 

e Not always all the pipes can be used...not very efficient 


Vliw alu 
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Vliw example from Cayman architecture 


Table 4.1 Instruction Slots in an Instruction Group 


sa) — O re O | 

[ 0 [Scalar instruction for ALUX ut — | 64 |sreXand EX vectorelement sit | 
src.Y and dst. Y vector-element slot 
Src.Z and dst.Z vector-element slot 


X, Y elements of literal constant (X is the first dword) 
Z, W elements of literal constant (Z is the first dword) 


Given the nntinns descrihed ahnve the size nf an AI LI instruction aran can 


Constant slot 
Constant slot 


. 
t 
H 


E.A 
CA 
| 64 
[Scalar instruction for ALU-Trans unit | 64 [Transcendental siot 
| 64 | 
EJ 


Moving to Scalar architecture 


e Single instructions runs across a vector of data 
5 : scalar alu 
e |t means you don't need to vectorize your code 


o The scheduler organizes vectors of data the instruction need to le [e [e (o 
run 


e Example :Sum 8 float is equivalent to sum 2 float4 
e Concept similar to loop unrolling on CPU 


[Lg + [1]e 


2 
Q 
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[zla + [z]e 
[ela + [ele 
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Fictional Shader Core 


16 scalar alus , 16 instruction in parallel 
Each alu can do one 32 bit floating point instruction 
in one cycle 

e one program counter 


o Same instruction is performed 16 times over 16 different data 
streams 


e Register file is big enough for the 16 alus to work in 
parallel and perform context switching 
e Each register have 16 slots, one for each thread 
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Fictional Shader Core -latency hiding- 


Cores need to access data in memory 

Accessing memory requires several hundreds of 

cycles 

During this period the alus have nothing to do 

However if the register file is big enough to contain 

multiple context, alus can switch to another thread 
e Ifthere are enough context, alu will not be idle. 
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Fictional Shader Core -latency hiding- 


e Example 
e Alus are processing the C1 group of threads 
e Atsome point there is a dependency stall 


Fictional Shader Core -latency hiding- 


Example 

Alus are processing the C1 group of threads 
At some point there is a dependency stall 

It then switches to C2 
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Fictional Shader Core -latency hiding- 


Example 

Alus are processing the C1 group of threads 
At some point there is a dependency stall 

It then switches to C2 

Eventually C2 will stall too 


39 


Fictional Shader Core -latency hiding- 


Example 

Alus are processing the C1 group of threads 
At some point there is a dependency stall 

It then switches to C2 

Eventually C2 will stall too 

Alus can switch to C3 
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Fictional Shader Core - occupancy- 


The register file is dynamically partitioned 

"Big" shaders requires many registers 

And it will affect the number of concurrent context 
Space only for one context? No latency hiding :( 


Achieving maximum £ of context is not fundamental 
o Usually memory bottleneck first 
o  Butneed to be high enough to hide latency 
Example: A shader takes 100 registers, register file is 10kb 
o 10240 bytes /(16 alus * 4bytes (32bit)) = 160 
o  160/100=1.6 context :( 
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Fictional Gpu 


16 cores 
Each core 16 alu 
256 operations in parallel over 16 different 
instruction stream 
Clocked @1Ghz = 256 Gigaflop. 
Gpu stages are executed in parallel 
As soon as a triangle is transformed it is 
rasterized 

e Each core can deal independently with pixel 
shaders, vertex shaders, compute and so on 
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Shader system use case : GCN 


Input Data (PC/State/Vector Register/Scalar Register) 


Message Bus 


bitration 
su 
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SIMDO 


8KB Registers |. 64KB 


——— 
Integer ALU 


=r" 
Vector 


Instruction Fetch 


4 CU Shared 32KB Instruction L: 


Wavefronts 


The smallest unit of work in gcn is a wavefront 
A wavefront is a group of 64 threads 
A thread is a single "instance" of the shader that work across only one data 
path/ lane 
e Example: 
void main() 


gl FragColor - vec4(0.4, 0.4, 0.8, 1.0); 


e A wavefront is 64 pixel worth of work 
e Athread is 1 pixel inside a wavefront 
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VGPR / SGPR 


e Avgpris a register that has 64 32-bit entries 
o Imagine them uint 32 vgpr[64]; 
e An operation that takes a vpgr operands will happen on all the 64 entries 
simultaneously 
e A SGPR instead is a register that is a single 32bit entries 


o Useful for operation that are constant across all the wavefronts, wavefront status flags and so 
on 
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example 


void main() 


gl FragColor - vec4(0.4, 0.4, 0.8, 1.0); 


v mov b32 vê, 0x36663666 
v mov b32 v1, 0x3c003a66 
exp mrtO, vð, vð, vl, vl done compr vm 


Move 0.4 into vgpr vO 

Move O.8 into vgpr v1 

Export the pixel as vO vO v1 

All of this happens 64 times simultaneously inside the CU 
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CU 


Smallest compitational unit 
A gpu contains many CUs 
A cu contains 4 simd unit 
o Each simd can execute an instruction on 16 different data (simd16) 
A scalar unit 
A branch unit 
256kb for vector registers 
o 256kb /4 simd/ 64 lane = 256 vgpr 


e Skb for scalar registers 
o 256kb/4 simd = 512 sgpr 
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CU simd 


e Each simd has its own program counter 


o Current instruction inside the wavefront 


e Each simd can process 16 32bit values in 1 cycle 
o An entire wavefront takes 4 cycle to be processed by a simd 
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integer ALU | 


CU simd 


Each simd has an instruction buffer of 10 wavefronts 
Maximum 40 wave in flight per CU 

Depending on registry usage 

Potentially coming from different kernel/shaders 


Message Bus 


SIMDO 
PCRIB 
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CU scalar unit 


e Mainly for control flow across the wavefront 
o  Exif(constant flag) then else 

e Constants are taken from a read Message Bus 
Only cache i 

e Also handling interrupts/synch 
Scalar operands operations 


CU branch unit 


e Handles vector branches 
o  Ex:if vgpr > O then else 


e Handles floating point exception 
e Send message to other units/host cpu 


y E Messagi Jus 
SIMDO Branch & | 


PC & IB om mé Message Unit 
10 Wave 


ri DENEN 


SIMD1 
PC & IB 
10 Wave 


SIMD2 
PC & IB 
10 Wave Vector 


3 ALU 
SIMD3 
PC & IB 
10 Wave 


Vector Memory Decode 
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LDS 
Decode 


LDS/GDS 


A CU have also a shared read/write memory of 64kb (LocalDataShare) 
Used by pixel shader as storage for interpolant 
But fully accessed by the programmer 

o Thread Group Memory 
Needs to handle atomic operation and thread group synchronization 
Example usage: caching texure data across a compute threadgroup 


GDS is shared across all the CU 


Vector Op LDS Instruction Decode 


Can do ordered. cont op x =. e Pre Op/VecOp Return Data 


Input Buffering and Request Selection (1/2 wavefront per clo 
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Conflict qe Memory Banks (64KB total) 
Detection 
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: -- Read Data Cross Bar 

Scheduling | | | | 
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Integer Atomic Units 


Write Data Cross Bar 


Export 


When the program is finished it usually issue an export 
Always the case of a pixel shader 
It marks the end of the programmable part and pass down the data to fixed 


function block 
o Ex export in a pixel shader pass the control over the color block 
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Vector memory 


CU have an internal 16kb L1 cache for vector memory operation 
o Usually texture data 


L2 is outside the CU 


Figure 6: Cache Hierarchy 


Command Processors 64-128KB Read/Write Memory 
——— L2 Cache Controller 


Compute 


Unit 16KB L1 Vector Data Cache Request 
& 
64-128KB Read/Write Memory 
16 KB L1 Scalar Data Cache Data L2 Cache Controller 
% 32 KB L1 Instruction Cache 4 Crossbar = —MMM 


16KB L1 Vector Data Cache 


Compute 
Unit 
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Shader system use case: Nvidia Turing(TU102 


PC! Express 3.0 Host Interface 
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Streaming multiprocessor (SM) 


An SM contains 4 group of 32 

cores 

Each group have its own 

Instruction buffer,warp scheduler 
Register files 

4 texture units /I1 cache and 

64kb shader memory 

Each core have 16 floating point unit 
16 integer unit and two tensor cores 
Each SM have a dedicated Ray Tracing unit 
For traversal / intersection 


Warp Scheduler + Dispabeh (32 threadieth) Warp Scheduler + Dispatch (32 Menadicik) 


Register File (16,384 x 32-bit) Register File (16,384 x 32-bit) 


Warp Scheduler + Dispateh (32 threadict) Wwrp Schedeter + Dispatch (32 thewndicik] 


Register File (16,384 x 32-bit) 
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Register File (16,384 x 32-bit) 
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Warps 


A warp is a group of 32 threads and is the smallest unit of work 
Each SM can hold 64 warps in flight 

Each thread can access a maximum of 255 registry 

Usage determine actual number of concurrent thread 
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Graphics processing cluster (GPC) 


e Each GPC has: 
e A Rasterizer 
. GPC 
o Turns triangles data into actual pixel Raster Engine 
A Tse Te rs Ts +. +. 
o Ready to be dispatched as warp TPC TPC TPC TPC 
. PotyMorph Engine — PoltyMorph Engine PolyMorph Engine — PolyMorph Engine — PotyMorph Engine PolyMorph Engine 

o Perform triangle and z culling 

e 


SM SM SM SM SM SM 
6 Texture Processor Cluster | i | | l | i | | ] | | 
o 2SMeach 
A polymorph engine 


Perform vertex fetching and assembles y | í i | | | | f | i | 


Vertex warps 
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GeForce RTX 2080 
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A GPU is an extremely parallel machine 

o Each of the stages are executed as soon as 
there's enough work to do. 
This is called immediate rendering 
Shader system can have any kind of work in 
flight at a given time. 

o While the Rasterizers, PA, IA and output merger 
are processing other things. 

o Data dependency is a limiting factor. 


Knowing what happens in each stage can 
help spot the bottleneck. 
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Radeon gpu profile 


About mobile GPUs 


Problem on mobile 


Battery consumption is king 
Very high bandwidth memory system is power demanding 
Low bandwidth is “slow” 
o 10x slower than mobile 
e Solution : Tiled based / Tile based deferred architectures 
o  TBR/TBDR for short 
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Tile Based architecture (l) 


Use of a hi speed on chip cache. 


Used as temporary storage during vertex/pixel shading. 


Main Memory can be low bandwidth. 
Only "invoked" when writing the final pixel data. 
An for texture access. 
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What we want on screen 


Tile Based architecture (Il) 


e To maximize cache usage screen is divided in 
tiles 


e Screen is renderender one tile at the time | 


e Tile cache is used at temporary store for the 
framebuffer 

e When finished the content of the tile is written 
back to memory 


CE EEE: 


Tile Based architecture (111) 


We need to sort all the triangles in tiles 
We need to pre process all the geometry first 
All vertex shaders runs first 
Then we know triangles per tile 
o “Binning” 
Binning happens in main memory 
o  Basedon principle that “normally” there are less triangles than pixel 
On a desktop GPU pixel and vertex runs in parallel 
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Tile Based Deferred architecture 


Once all triangles are sorted in tiles pixel processing 
can start 
Since we know all the primitive on tile we can pick 
only the one that contribute to pixel color 

o Example: nearest one 
If this happen the architecture is said to be “deferred” 
Great reduction of pixel shading work 
Only runs shaders that actually write a pixel 
On a desktop GPU this is possible with "Z prepass" 


o  Butneed to submit the geometry twice 
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TBR pro and cons 


Pro 


O 


O 
O 
O 


Frame buffer bandwidth reduces 
Z prepass “free” 
Tiled cache more efficient than cache lines 
Blending happens in the tile cache 
m Programmable blending possibile 


Cons 


O O O O 


Split rendering in two, lockstepped, stages 

Tile cache limits usage of frame buffer format and multiple render target 
Complex scene might slow down heavily the binning process 

Harder to read cross tile pixels 
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Gpus- where are we now 


Mobile and desktop fundamentally different. 


Proprietary features to optimize the vertex pipeline 

o Nvidia - mesh shaders 

o  Amd- primitive shaders 

o  VRrendering vertex optimization 
e Proprietary features to optimize rasterization 

o Native variable shading for foveated rendering 

o Tile rendering similar to Mobile architecture 

o Mentioned in vega white paper 

o Nvidia experiment https://github.com/nlguillemot/trianglebin 
e Raytracing 

o Really hard problem to solve 

o Dynamic bhv creation 

o  Gpu traversal 

o Handling incoherent rays 
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Conclusions 


Basics of both Desktop and mobile architecture covered 
Highlighted possible bottlenecks for each stage 
We can use this knowledge to understand profiling 


GPU evolved over the years 
o Always know your target architecture! 
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